Our interest lies in direct marketing campaigns. We want to know the effectiveness of the direct marketing campaigns, and whether a customer would subscribe to a term deposit through a direct marketing campaign. Also, we want to predict after how many marketing campaigns, the customer would subscribe a term deposit. Moreover, we want to learn whether other attributes like, job, age, balance and loan would affect the result of subscribing a marketing campaigns.
Build models to predict whether a customer would subscribe a term deposit or not. After, we build the model and find a pattern for customer who have more possibilty to subcribe a term deposit. The bank can put more human resources on the target customers instead of making worthless effort. Help the bank to increase the marketing campaigns successful rate.
The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. It consists of 41188 observations with 16 attributes including bank clients data, data related with the last contact of the current campaign, social and economic context attributes and other attributes.
age: the customers age
balance: the balance of the customers
duration: last contact duration, in seconds.
campaign: number of contacts performed during this campaign and for this client.
pdays: number of days that passed by after the client was last contacted from a previous campaign(-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
day: last contact day of the month
job : type of job
marital: marital status
education: education level
default: has credit in default?
housing: has a housing loan?
loan: has a personal loan?
contact: contact communication type
month: last contact month of year
poutcome: outcome of the previous marketing campaign
y - has the client subscribed to a term deposit? (binary: ‘yes’,‘no’)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
summarize_numeric = function(dataset) {
dataset = select_if(dataset, is.numeric)
summary.table = data.frame(Attribute = names(dataset))
summary.table = summary.table %>%
mutate('Missing Values' = apply(dataset, 2, function (x) sum(is.na(x))),
'Unique Values' = apply(dataset, 2, function (x) length(unique(x))),
'Mean' = colMeans(dataset, na.rm = TRUE),
'Min' = apply(dataset, 2, function (x) min(x, na.rm = TRUE)),
'Max' = apply(dataset, 2, function (x) max(x, na.rm = TRUE)),
'SD' = apply(dataset, 2, function (x) sd(x, na.rm = TRUE))
)
summary.table
}
summarize_character = function(dataset) {
dataset = select_if(dataset, is.character)
summary.table = data.frame(Attribute = names(dataset))
summary.table = summary.table %>%
mutate('Missing Values' = apply(dataset, 2, function (x) sum(is.na(x))),
'Unique Values' = apply(dataset, 2, function (x) length(unique(x))),
)
summary.table
}
Give a summary view of the data.
bank = read_csv('bank-full.csv',show_col_types = FALSE)
sc_bank <- summarize_character(bank)
sn_bank<- summarize_numeric(bank) %>% mutate_if(is.numeric, round, digits = 2)
library(knitr)
knitr::kable(sn_bank,"simple")
| Attribute | Missing Values | Unique Values | Mean | Min | Max | SD |
|---|---|---|---|---|---|---|
| age | 0 | 77 | 40.94 | 18 | 95 | 10.62 |
| balance | 0 | 7168 | 1362.27 | -8019 | 102127 | 3044.77 |
| day | 0 | 31 | 15.81 | 1 | 31 | 8.32 |
| duration | 0 | 1573 | 258.16 | 0 | 4918 | 257.53 |
| campaign | 0 | 48 | 2.76 | 1 | 63 | 3.10 |
| pdays | 0 | 559 | 40.20 | -1 | 871 | 100.13 |
| previous | 0 | 41 | 0.58 | 0 | 275 | 2.30 |
knitr::kable(sc_bank,"simple")
| Attribute | Missing Values | Unique Values |
|---|---|---|
| job | 0 | 12 |
| marital | 0 | 3 |
| education | 0 | 4 |
| default | 0 | 2 |
| housing | 0 | 2 |
| loan | 0 | 2 |
| contact | 0 | 3 |
| month | 0 | 12 |
| poutcome | 0 | 4 |
| y | 0 | 2 |
bank = bank %>% mutate(job = as.factor(job),
marital = as.factor(marital),
education= as.factor(education),
default = as.factor(default),
housing = as.factor(housing),
loan = as.factor(loan),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome),
y = as.factor(y))
colnames(bank %>% select_if(is.factor))
## [1] "job" "marital" "education" "default" "housing" "loan"
## [7] "contact" "month" "poutcome" "y"
colnames(bank %>% select_if(is.numeric))
## [1] "age" "balance" "day" "duration" "campaign" "pdays" "previous"
There are 10 attributes in the categories attributes which are job, marital, education , default, housing, loan, contact, month, poutcome, and y. There are 7 attributes in measures which are age, balance, day, duration, campaign, pdays, and previous. Moreover, from the summary table, we find that the data is quite clean. There is no missing values in all the attributes. All the attributes have more than 2 unique values which means, we do not need to delete any attribute at this point.
First, we draw the numeric attributes distributions, and we find that the value of balance, pdays, and previous are quite concentrated. Most of the value of valance are around 0.The pdays values are concentrated at -1 and the previous values are concentrated at 0. The previous attributes means number of contacts performed before this campaign and for this client and pdays means number of days that passed by after the client was last contacted from a previous campaign and -1 means this customer was not previous contacted. In this way, the previous and pdays are high correlated and the customer which previous is 0 value is same as the customers which have pdays value -1 which means these customers was not previous contacted. In this way, we can seperate the data into contacted before and was not contacted before.We can have a better distribution plot of pdays and previous with the customers were contacted before.
p1 = ggplot(bank) + geom_bar(aes(x = age))
p2 = ggplot(bank) + geom_bar(aes(x = balance), width = 500)
p3 = ggplot(bank) + geom_bar(aes(x = day))
p4 = ggplot(bank) + geom_bar(aes(x = duration))
p5 = ggplot(bank) + geom_bar(aes(x = campaign))
p6 = ggplot(bank) + geom_bar(aes(x = pdays))
p7 = ggplot(bank) + geom_bar(aes(x = previous))
grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for all the customers")
## Warning: position_stack requires non-overlapping x intervals
bank_contacted <- bank[bank$previous!=0,]
Now we draw the distribution plot for contacted customers.
p1 = ggplot(bank_contacted) + geom_bar(aes(x = age))
p2 = ggplot(bank_contacted) + geom_bar(aes(x = balance))
p3 = ggplot(bank_contacted) + geom_bar(aes(x = day))
p4 = ggplot(bank_contacted) + geom_bar(aes(x = duration))
p5 = ggplot(bank_contacted) + geom_bar(aes(x = campaign))
p6 = ggplot(bank_contacted) + geom_bar(aes(x = pdays))
p7 = ggplot(bank_contacted) + geom_bar(aes(x = previous))
grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for contacted customers")
After we drawed the categorical distrubutions we found that there are significant numbers of customers are blue-collar, management and technician. Most cusromers did not have credit. Moreover, most customers are contacted through cellular.
p8 = ggplot(bank) + geom_bar(aes(x = job)) + theme(axis.text.x = element_text(angle=20, hjust = 1, size=8))
p9 = ggplot(bank) + geom_bar(aes(x = marital))
p10 = ggplot(bank) + geom_bar(aes(x = education))
p11 = ggplot(bank) + geom_bar(aes(x = default))
p12 = ggplot(bank) + geom_bar(aes(x = housing))
p13 = ggplot(bank) + geom_bar(aes(x = loan))
p14 = ggplot(bank) + geom_bar(aes(x = contact))
p15 = ggplot(bank) + geom_bar(aes(x = month))
p16 = ggplot(bank) + geom_bar(aes(x = poutcome))
p17 = ggplot(bank) + geom_bar(aes(x = y))
grid.arrange(p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, nrow=5, top = "Categorical Attributes")
From the Correlation matrix we found that the correlation between pdays and previous is quite high. And for the rest of the attributes the correlations are quiet low which is good.
library(ggcorrplot)
fullCorrMatrix = round(cor(bank %>% select_if(is.numeric)), 2)
ggcorrplot(fullCorrMatrix, type = "lower", outline.col = "white",lab = TRUE)
No linear relationship among most numeric attributes. But we can find the tendency that when the ‘balance’ becomes higher, the ‘duration’, ‘campaign’ and ‘previous’ are more likely to be lower. and it also appears in ‘duration’ and ‘pdays’.
library(gridExtra)
library(ggcorrplot)
bank <- filter(bank, duration > 0 & pdays < 999)
#balance
gg8 = ggplot(bank) + geom_point(aes(x=`balance`, y = `duration`))
gg9 = ggplot(bank) + geom_point(aes(x=`balance`, y = `campaign`))
gg11 = ggplot(bank) + geom_point(aes(x=`balance`, y = `previous`))
grid.arrange( gg8, gg9, gg11,nrow=3)
#day
gg12 = ggplot(bank) + geom_point(aes(x=`day`, y = `duration`))
gg14 = ggplot(bank) + geom_point(aes(x=`day`, y = `pdays`))
grid.arrange(gg12, gg14, nrow=2)
#duration
gg17 = ggplot(bank) + geom_point(aes(x=`duration`, y = `pdays`))
grid.arrange( gg17)
#campaign
gg19 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `pdays`))
grid.arrange(gg19)
Last contact month of year is correlated to the client’s job. Job management accounts for the large proportions in each month. Job is impacted by the level of the education. ### job by Category
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = job), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = job), position = "fill") + labs(y = "Percent")
grid.arrange(g3, g8, nrow=2, top = "job by Category")
### education by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =education), position = "fill") + labs(y = "Percent")
grid.arrange(g1, nrow=1, top = "education by Category")
No apparent relationship or unexpected observation between other categories for the category distribution.
While looking at measure distribution by different category values, nothing too surprising or unexpected observation for this dataset.
When the client outcome of previous marketing campaign is success, the number of days that passed by after the client was last contacted tend to be smaller than failure and other clients.
cm18 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = pdays)) + theme(axis.title.y = element_blank())
grid.arrange(cm18, nrow=1, top = "age by Category")
Some correlation between age and contact method, marital status.
cm51 = ggplot(bank) + geom_boxplot(aes(x=marital, y = age)) + theme(axis.title.y = element_blank())
cm56 = ggplot(bank) + geom_boxplot(aes(x=contact, y = age)) + theme(axis.title.y = element_blank())
grid.arrange(cm51,cm56, nrow=1, top = "pdays by Category")
# splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
train_sub <- sample(nrow(bank),0.7*nrow(bank))
train_bank <-bank[train_sub,]
test_bank <-bank[-train_sub,]
#pairs(bank)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
#find the best 'mtry'(Number of variables available for splitting at each tree node) :
rf_bank_train1<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=2, na.action = na.pass)
rf_bank_train1
##
## Call:
## randomForest(formula = y ~ ., data = train_bank, importance = TRUE, mtry = 2, na.action = na.pass)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 9.57%
## Confusion matrix:
## no yes class.error
## no 27539 440 0.01572608
## yes 2589 1077 0.70621931
rf_bank_train2<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=3, na.action = na.pass)
rf_bank_train2
##
## Call:
## randomForest(formula = y ~ ., data = train_bank, importance = TRUE, mtry = 3, na.action = na.pass)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.07%
## Confusion matrix:
## no yes class.error
## no 27138 841 0.03005826
## yes 2030 1636 0.55373704
rf_bank_train3<-randomForest(y~.,data=train_bank,importance=TRUE, mtry=4, na.action = na.pass)
rf_bank_train3
##
## Call:
## randomForest(formula = y ~ ., data = train_bank, importance = TRUE, mtry = 4, na.action = na.pass)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 9.04%
## Confusion matrix:
## no yes class.error
## no 27006 973 0.03477608
## yes 1889 1777 0.51527550
From the result, we can see when mtry = 4, we could find the minimum OOB estimate of error rate is 9.13%.
#plot the number of trees
plot(rf_bank_train3)
#the importance of variables
rf_bank_train3$importance
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## age 7.496521e-03 0.0060447612 7.326765e-03 569.24017
## job 3.893014e-03 -0.0023008840 3.175810e-03 436.47141
## marital 5.009627e-04 0.0052322811 1.048044e-03 125.44960
## education 1.217041e-03 0.0003085425 1.111189e-03 162.55752
## default -2.461212e-06 0.0002622862 2.825892e-05 10.06423
## balance 7.890527e-04 0.0045177634 1.220080e-03 609.06266
## housing 5.960913e-03 0.0106493644 6.501830e-03 122.25261
## loan 7.926129e-05 0.0029164460 4.078979e-04 49.86678
## contact 3.702761e-02 0.0037847642 3.318247e-02 114.73747
## day 1.851379e-02 0.0023029936 1.664114e-02 511.99858
## month 7.063417e-02 0.0196806179 6.474106e-02 754.47495
## duration 2.077674e-02 0.1963493248 4.108814e-02 1783.61650
## campaign 1.690049e-03 0.0034100508 1.887987e-03 226.13347
## pdays 2.269822e-02 0.0227813053 2.270797e-02 273.55473
## previous 1.381749e-02 0.0084270203 1.319319e-02 131.79404
## poutcome 1.988401e-02 0.0222062205 2.015739e-02 454.84322
#plot the importance
varImpPlot(rf_bank_train3, main = "variable importance")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
#Predicting in test data set
Predict_rf <- predict(rf_bank_train3, newdata=test_bank, type = "class")
rf_cf <- caret::confusionMatrix(as.factor(Predict_rf),as.factor(test_bank$y) )
rf_cf
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11514 854
## yes 426 769
##
## Accuracy : 0.9056
## 95% CI : (0.9006, 0.9105)
## No Information Rate : 0.8803
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4945
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9643
## Specificity : 0.4738
## Pos Pred Value : 0.9310
## Neg Pred Value : 0.6435
## Prevalence : 0.8803
## Detection Rate : 0.8489
## Detection Prevalence : 0.9119
## Balanced Accuracy : 0.7191
##
## 'Positive' Class : no
##
#boosting
set.seed(1)
library(gbm)
## Loaded gbm 2.1.8
library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
train_bank$y = ifelse(train_bank$y == "yes",1,0)
bank_gb = gbm(y~.,distribution = "bernoulli",data = train_bank,n.trees = 500,interaction.depth = 4,cv.folds = 3)
summary(bank_gb)
| var | rel.inf | |
|---|---|---|
| duration | duration | 35.3469104 |
| month | month | 20.6717401 |
| poutcome | poutcome | 13.8235783 |
| job | job | 5.6459656 |
| age | age | 5.0004557 |
| day | day | 4.7525821 |
| balance | balance | 3.2943917 |
| contact | contact | 3.0841049 |
| pdays | pdays | 2.9610691 |
| housing | housing | 2.3891585 |
| campaign | campaign | 0.7379019 |
| marital | marital | 0.7315589 |
| education | education | 0.6833697 |
| loan | loan | 0.4300640 |
| previous | previous | 0.4036700 |
| default | default | 0.0434793 |
#confusion matrix
set.seed(1)
Predict_rf <- predict(rf_bank_train3, newdata=test_bank)
yhat_boost =predict(bank_gb, newdata = test_bank, n.trees=500)
boost_err =table(pred =Predict_rf, truth = test_bank$y)
colnames(boost_err) <- c("No","Yes")
## 70% of the sample size
smp_size <- floor(0.7 * nrow(bank))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(bank)), size = smp_size)
train <- bank[train_ind, ]
test <- bank[-train_ind, ]
library(rpart)
library(rpart.plot)
ct <- rpart.control(xval=10, minsplit=20, cp=0.01)
cfit <- rpart(y~.,
data=train, method="class", control=ct,
parms=list(split="gini")
)
rpart.plot(cfit, main="Decision Tree")
library(tree)
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
summary(tree(y~., data=train, method = "class"))
##
## Classification tree:
## tree(formula = y ~ ., data = train, method = "class")
## Variables actually used in tree construction:
## [1] "duration" "poutcome" "month" "contact"
## Number of terminal nodes: 9
## Residual mean deviance: 0.4879 = 15440 / 31640
## Misclassification error rate: 0.1095 = 3465 / 31645
p <- predict(cfit, test ,type="class")
table(p, test$y)
##
## p no yes
## no 11706 1013
## yes 321 523
# splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
dt = sort(sample(nrow(bank), nrow(bank)*.7))
train<-bank[dt,]
test<-bank[-dt,]
mylogit <- glm(y ~., data = train, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6779 -0.3733 -0.2505 -0.1481 3.4237
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.353e+00 2.194e-01 -10.724 < 2e-16 ***
## age -1.364e-03 2.653e-03 -0.514 0.607229
## jobblue-collar -3.820e-01 8.702e-02 -4.390 1.13e-05 ***
## jobentrepreneur -3.622e-01 1.484e-01 -2.440 0.014688 *
## jobhousemaid -5.305e-01 1.617e-01 -3.280 0.001037 **
## jobmanagement -1.757e-01 8.724e-02 -2.014 0.044031 *
## jobretired 3.204e-01 1.152e-01 2.781 0.005426 **
## jobself-employed -3.168e-01 1.336e-01 -2.372 0.017708 *
## jobservices -3.503e-01 1.021e-01 -3.430 0.000605 ***
## jobstudent 3.905e-01 1.272e-01 3.071 0.002133 **
## jobtechnician -2.166e-01 8.191e-02 -2.645 0.008180 **
## jobunemployed -2.509e-01 1.326e-01 -1.892 0.058454 .
## jobunknown -4.379e-01 2.767e-01 -1.583 0.113518
## maritalmarried -1.842e-01 7.072e-02 -2.604 0.009204 **
## maritalsingle 7.543e-02 8.066e-02 0.935 0.349673
## educationsecondary 1.677e-01 7.809e-02 2.147 0.031767 *
## educationtertiary 3.562e-01 9.031e-02 3.945 7.99e-05 ***
## educationunknown 2.685e-01 1.256e-01 2.137 0.032578 *
## defaultyes -2.201e-01 2.063e-01 -1.067 0.285993
## balance 1.946e-05 5.813e-06 3.349 0.000812 ***
## housingyes -6.642e-01 5.241e-02 -12.674 < 2e-16 ***
## loanyes -4.085e-01 7.153e-02 -5.711 1.12e-08 ***
## contacttelephone -2.517e-01 9.107e-02 -2.764 0.005705 **
## contactunknown -1.652e+00 8.855e-02 -18.658 < 2e-16 ***
## day 1.227e-02 2.985e-03 4.109 3.97e-05 ***
## monthaug -7.402e-01 9.266e-02 -7.989 1.37e-15 ***
## monthdec 6.197e-01 2.171e-01 2.854 0.004312 **
## monthfeb -1.274e-01 1.045e-01 -1.218 0.223043
## monthjan -1.207e+00 1.406e-01 -8.584 < 2e-16 ***
## monthjul -9.357e-01 9.257e-02 -10.108 < 2e-16 ***
## monthjun 4.156e-01 1.119e-01 3.714 0.000204 ***
## monthmar 1.516e+00 1.426e-01 10.634 < 2e-16 ***
## monthmay -4.393e-01 8.566e-02 -5.128 2.93e-07 ***
## monthnov -9.575e-01 1.004e-01 -9.539 < 2e-16 ***
## monthoct 8.457e-01 1.295e-01 6.529 6.63e-11 ***
## monthsep 7.651e-01 1.455e-01 5.258 1.45e-07 ***
## duration 4.162e-03 7.692e-05 54.107 < 2e-16 ***
## campaign -8.619e-02 1.196e-02 -7.208 5.67e-13 ***
## pdays -4.089e-04 3.691e-04 -1.108 0.267949
## previous 1.012e-02 6.848e-03 1.479 0.139269
## poutcomeother 2.006e-01 1.079e-01 1.859 0.062979 .
## poutcomesuccess 2.356e+00 9.868e-02 23.874 < 2e-16 ***
## poutcomeunknown -1.618e-01 1.115e-01 -1.451 0.146707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22872 on 31644 degrees of freedom
## Residual deviance: 15039 on 31602 degrees of freedom
## AIC: 15125
##
## Number of Fisher Scoring iterations: 6
library(caret)
glm.probs <- predict(mylogit,test,type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
glm.pred <- as_factor(glm.pred)
confusionMatrix(glm.pred,test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11681 1029
## yes 303 550
##
## Accuracy : 0.9018
## 95% CI : (0.8967, 0.9068)
## No Information Rate : 0.8836
## P-Value [Acc > NIR] : 7.089e-12
##
## Kappa : 0.4036
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9747
## Specificity : 0.3483
## Pos Pred Value : 0.9190
## Neg Pred Value : 0.6448
## Prevalence : 0.8836
## Detection Rate : 0.8612
## Detection Prevalence : 0.9371
## Balanced Accuracy : 0.6615
##
## 'Positive' Class : no
##
library(ROCR)
pr <- prediction(glm.probs, test$y)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
auc <- as.numeric(performance(pr, "auc")@y.values)
auc
## [1] 0.9044215
plot(prf,
lwd = 3, colorize = TRUE,
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('auc : ', round(auc, 5)))
abline(0, 1, col = "red", lty = 2)
glm.pred2 <- ifelse(glm.probs > 0.09 , "yes", "no")
glm.pred2 <- as_factor(glm.pred2)
confusionMatrix(glm.pred2,test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9418 186
## yes 2566 1393
##
## Accuracy : 0.7971
## 95% CI : (0.7902, 0.8038)
## No Information Rate : 0.8836
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4038
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7859
## Specificity : 0.8822
## Pos Pred Value : 0.9806
## Neg Pred Value : 0.3519
## Prevalence : 0.8836
## Detection Rate : 0.6944
## Detection Prevalence : 0.7081
## Balanced Accuracy : 0.8340
##
## 'Positive' Class : no
##
For random forest, we use a loop to find the best ‘mtry’ (Number of variables available for splitting at each tree node), which is 4, with the lowest out of bag estimate of error rate: 9.13%. We find that duration is the most important factor. The longer contact duration lasts, the higher probability that the client would subscribe the term deposit. Month and day are also important factors. The model test accuracy is 90.92%, which seems a good classification model for the prediction. The Kappa is 0.501, which shows good consistency. Then, we use gradient boosting to display the relative influence plot and the relative influence statistics. From the plot, we can also conclude that duration, month and poutcome are three most important variables among all the predictors.
For decision tree, duration is also the most important factor. If the duration is greater than 830 then the customer would have a higher possibility to subscribe the term deposit. More over the poutcome is also an important factor, if the contact duration is not long enough, but the customer has subscribed a term deposit before. The customer would have a high possibility to subscribe a term deposit in this campaign.
For the part of Logistic Regression, we first randomly select 70% of the data into training dataset and put the remaining 30% of the data into testing dataset.Secondly, we built a logistic model by using glm() function. By looking at the p-value in the summary, we found duration, month, day, higher education and campaigns, these are the attributes that highly significant in this model. On the other hand, attributes like age, pdays and previous are not statistically significant at level 0.05. After we build the model, we first make the prediction at threshold equal to 0.5, which will predict positive if the probability is bigger than the threshold. As we can see in the confusion matrix, the accuracy is about 90 percent. So the model actually made a very good prediction.However, while we made the ROC curve, we found the optimal threshold is 0.09 in the curve. We further create a second prediction, but the accuracy is lower to about 80 percent. So that’s kind of an interesting result. Since it is logistic model, we can customize the threshold when facing different problems. There always will be a tradeoff between true positive and true negative while we optimize that. Comparing these two matrices, we can observe that when true positives decreases, the true negative increases. Since we aim to find out the effectiveness of market campaigns and target customers, we want to maximize the true negative, which is the customer truly say yes. So we choose 0.09 as our threshold.
All our models have high predictive power as indicated by duration which is the last contact duration with the costumers.We can also conclude and suggest that company to focus on their previous customer who have subscribed a term deposit in the previous campaign. Ideal target customer is in the mid-age with higher education. We also recommend keeping the contact duration as long as possible.From the logistic regression, we find out that higher education, longer contact duration seem to be predictive and positively correlated with the subscription yes. And 0.09 is a ideal cut-off point for this business problem.From the random forest and boosting model, we can see when the outcome of the previous marketing campaign is successful, the last contact month of year gets closer and the duration lasts longer, the client is more likely to subscribe the term deposit.Our decision tree models suggests that if the contact duration is higher than 830 seconds, then there will be a higher possibility that the customer would subscribe the term deposit. In the end, We decided to use the logistic regression model, since the predict result is the best and will help the company to find more true positive customer instead of just high accuracy rate.
p1 = ggplot(bank) + geom_bar(aes(x = age))
p2 = ggplot(bank) + geom_bar(aes(x = balance), width = 500)
p3 = ggplot(bank) + geom_bar(aes(x = day))
p4 = ggplot(bank) + geom_bar(aes(x = duration))
p5 = ggplot(bank) + geom_bar(aes(x = campaign))
p6 = ggplot(bank) + geom_bar(aes(x = pdays))
p7 = ggplot(bank) + geom_bar(aes(x = previous))
grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for all the customers")
## Warning: position_stack requires non-overlapping x intervals
bank_contacted <- bank[bank$previous!=0,]
Now we draw the distribution plot for contacted customers.
p1 = ggplot(bank_contacted) + geom_bar(aes(x = age))
p2 = ggplot(bank_contacted) + geom_bar(aes(x = balance))
p3 = ggplot(bank_contacted) + geom_bar(aes(x = day))
p4 = ggplot(bank_contacted) + geom_bar(aes(x = duration))
p5 = ggplot(bank_contacted) + geom_bar(aes(x = campaign))
p6 = ggplot(bank_contacted) + geom_bar(aes(x = pdays))
p7 = ggplot(bank_contacted) + geom_bar(aes(x = previous))
grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4, top = "Numeric Attributes for contacted customers")
After we drawed the categorical distrubutions we found that there are significant numbers of customers are blue-collar, management and technician. Most cusromers did not have credit. Moreover, most customers are contacted through cellular.
p8 = ggplot(bank) + geom_bar(aes(x = job)) + theme(axis.text.x = element_text(angle=20, hjust = 1, size=8))
p9 = ggplot(bank) + geom_bar(aes(x = marital))
p10 = ggplot(bank) + geom_bar(aes(x = education))
p11 = ggplot(bank) + geom_bar(aes(x = default))
p12 = ggplot(bank) + geom_bar(aes(x = housing))
p13 = ggplot(bank) + geom_bar(aes(x = loan))
p14 = ggplot(bank) + geom_bar(aes(x = contact))
p15 = ggplot(bank) + geom_bar(aes(x = month))
p16 = ggplot(bank) + geom_bar(aes(x = poutcome))
p17 = ggplot(bank) + geom_bar(aes(x = y))
grid.arrange(p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, nrow=5, top = "Categorical Attributes")
library(gridExtra)
library(ggcorrplot)
bank <- filter(bank, duration > 0 & pdays < 999)
#age
gg1 = ggplot(bank) + geom_point(aes(x=`age`, y = `balance`))
gg2 = ggplot(bank) + geom_point(aes(x=`age`, y = `day`))
gg3 = ggplot(bank) + geom_point(aes(x=`age`, y = `duration`))
gg4 = ggplot(bank) + geom_point(aes(x=`age`, y = `campaign`))
gg5 = ggplot(bank) + geom_point(aes(x=`age`, y = `pdays`))
gg6 = ggplot(bank) + geom_point(aes(x=`age`, y = `previous`))
grid.arrange(gg1, gg2, gg3, gg4,gg5,gg6, nrow=3)
#balance
gg7 = ggplot(bank) + geom_point(aes(x=`balance`, y = `day`))
gg8 = ggplot(bank) + geom_point(aes(x=`balance`, y = `duration`))
gg9 = ggplot(bank) + geom_point(aes(x=`balance`, y = `campaign`))
gg10 = ggplot(bank) + geom_point(aes(x=`balance`, y = `pdays`))
gg11 = ggplot(bank) + geom_point(aes(x=`balance`, y = `previous`))
grid.arrange(gg7, gg8, gg9, gg10, gg11,nrow=3)
#day
gg12 = ggplot(bank) + geom_point(aes(x=`day`, y = `duration`))
gg13 = ggplot(bank) + geom_point(aes(x=`day`, y = `campaign`))
gg14 = ggplot(bank) + geom_point(aes(x=`day`, y = `pdays`))
gg15 = ggplot(bank) + geom_point(aes(x=`day`, y = `previous`))
grid.arrange(gg12, gg13, gg14, gg15, nrow=2)
#duration
gg16 = ggplot(bank) + geom_point(aes(x=`duration`, y = `campaign`))
gg17 = ggplot(bank) + geom_point(aes(x=`duration`, y = `pdays`))
gg18 = ggplot(bank) + geom_point(aes(x=`duration`, y = `previous`))
grid.arrange(gg16, gg17, gg18, nrow=2)
#campaign
gg19 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `pdays`))
gg20 = ggplot(bank) + geom_point(aes(x=`campaign`, y = `previous`))
grid.arrange(gg19, gg20,nrow=2)
#pdays
gg21 = ggplot(bank) + geom_point(aes(x=`pdays`, y = `previous`))
#job by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill = job), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = job), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = job), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = job), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = job), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = job), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = job), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = job), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = job), position = "fill") + labs(y = "Percent")
grid.arrange(g2, g3, g4, nrow=3, top = "job by Category")
grid.arrange(g5, g6, g7, nrow=3, top = "job by Category")
grid.arrange(g8, g9, nrow=3, top = "job by Category")
#marital by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =marital), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = marital), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = marital), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = marital), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = marital), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = marital), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = marital), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = marital), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = marital), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g3, g4, nrow=3, top = "marital by Category")
grid.arrange(g5, g6, g7, nrow=3, top = "marital by Category")
grid.arrange(g8, g9, nrow=3, top = "marital by Category")
#education by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =education), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = education), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = education), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = education), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = education), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = education), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = education), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = education), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = education), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g4, nrow=3, top = "education by Category")
grid.arrange(g5, g6, g7, nrow=3, top = "education by Category")
grid.arrange(g8, g9, nrow=3, top = "education by Category")
#default by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =default), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = default), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = default), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = default), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = default), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = default), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = default), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = default), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = default), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "default by Category")
grid.arrange(g5, g6, g7, nrow=3, top = "default by Category")
grid.arrange(g8, g9, nrow=3, top = "default by Category")
#housing by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =housing), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = housing), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = housing), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = housing), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = housing), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = housing), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = housing), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = housing), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = housing), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "housing by Category")
grid.arrange(g4, g6, g7, nrow=3, top = "housing by Category")
grid.arrange(g8, g9, nrow=3, top = "housing by Category")
#loan by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =loan), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = loan), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = loan), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = loan), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = loan), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = loan), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = loan), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = loan), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = loan), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "loan by Category")
grid.arrange(g4, g5, g7, nrow=3, top = "loan by Category")
grid.arrange(g8, g9, nrow=3, top = "loan by Category")
#contact by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =contact), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = contact), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = contact), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = contact), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = contact), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = contact), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = contact), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = contact), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = contact), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "contact by Category")
grid.arrange(g4, g5, g6, nrow=3, top = "contact by Category")
grid.arrange(g8, g9, nrow=3, top = "contact by Category")
#month by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =month), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = month), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = month), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = month), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = month), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = month), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = month), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = month), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = month), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "month by Category")
grid.arrange(g4, g5, g6, nrow=3, top = "month by Category")
grid.arrange(g7, g9, nrow=3, top = "month by Category")
#poutcome by Category
g1 = ggplot(bank) + geom_bar(aes(x=job, fill =poutcome), position = "fill") + labs(y = "Percent")
g2 = ggplot(bank) + geom_bar(aes(x=marital, fill = poutcome), position = "fill") + labs(y = "Percent")
g3 = ggplot(bank) + geom_bar(aes(x=education, fill = poutcome), position = "fill") + labs(y = "Percent")
g4 = ggplot(bank) + geom_bar(aes(x=default, fill = poutcome), position = "fill") + labs(y = "Percent")
g5 = ggplot(bank) + geom_bar(aes(x=housing, fill = poutcome), position = "fill") + labs(y = "Percent")
g6 = ggplot(bank) + geom_bar(aes(x=loan, fill = poutcome), position = "fill") + labs(y = "Percent")
g7 = ggplot(bank) + geom_bar(aes(x=contact, fill = poutcome), position = "fill") + labs(y = "Percent")
g8 = ggplot(bank) + geom_bar(aes(x=month, fill = poutcome), position = "fill") + labs(y = "Percent")
g9 = ggplot(bank) + geom_bar(aes(x=poutcome, fill = poutcome), position = "fill") + labs(y = "Percent")
grid.arrange(g1, g2, g3, nrow=3, top = "poutcome by Category")
grid.arrange(g4, g5, g6, nrow=3, top = "poutcome by Category")
grid.arrange(g7, g8, nrow=3, top = "poutcome by Category")
cm10 = ggplot(bank) + geom_boxplot(aes(x=job, y = age)) + theme(axis.title.y = element_blank())
cm11 = ggplot(bank) + geom_boxplot(aes(x=marital, y = age)) + theme(axis.title.y = element_blank())
cm12 = ggplot(bank) + geom_boxplot(aes(x=education, y = age))+ theme(axis.title.y = element_blank())
cm13 = ggplot(bank) + geom_boxplot(aes(x=default, y = age)) + theme(axis.title.y = element_blank())
cm14= ggplot(bank) + geom_boxplot(aes(x=housing, y = age)) + theme(axis.title.y = element_blank())
cm15 = ggplot(bank) + geom_boxplot(aes(x=loan, y = age)) + theme(axis.title.y = element_blank())
cm16 = ggplot(bank) + geom_boxplot(aes(x=contact, y = age)) + theme(axis.title.y = element_blank())
cm17 = ggplot(bank) + geom_boxplot(aes(x=month, y = age)) + theme(axis.title.y = element_blank())
cm18 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = age)) + theme(axis.title.y = element_blank())
grid.arrange(cm10,cm11,cm12,cm13,cm14,cm15,cm16,cm17,cm18, nrow=3, top = "age by Category")
cm20 = ggplot(bank) + geom_boxplot(aes(x=job, y = balance)) + theme(axis.title.y = element_blank())
cm21 = ggplot(bank) + geom_boxplot(aes(x=marital, y = balance)) + theme(axis.title.y = element_blank())
cm22 = ggplot(bank) + geom_boxplot(aes(x=education, y = balance))+ theme(axis.title.y = element_blank())
cm23 = ggplot(bank) + geom_boxplot(aes(x=default, y = balance)) + theme(axis.title.y = element_blank())
cm24= ggplot(bank) + geom_boxplot(aes(x=housing, y = balance)) + theme(axis.title.y = element_blank())
cm25 = ggplot(bank) + geom_boxplot(aes(x=loan, y = balance)) + theme(axis.title.y = element_blank())
cm26 = ggplot(bank) + geom_boxplot(aes(x=contact, y = balance)) + theme(axis.title.y = element_blank())
cm27 = ggplot(bank) + geom_boxplot(aes(x=month, y = balance)) + theme(axis.title.y = element_blank())
cm28 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = balance)) + theme(axis.title.y = element_blank())
grid.arrange(cm20,cm21,cm22,cm23,cm24,cm25,cm26,cm27,cm28, nrow=3, top = "balance by Category")
cm30 = ggplot(bank) + geom_boxplot(aes(x=job, y = duration)) + theme(axis.title.y = element_blank())
cm31 = ggplot(bank) + geom_boxplot(aes(x=marital, y = duration)) + theme(axis.title.y = element_blank())
cm32 = ggplot(bank) + geom_boxplot(aes(x=education, y = duration))+ theme(axis.title.y = element_blank())
cm33 = ggplot(bank) + geom_boxplot(aes(x=default, y = duration)) + theme(axis.title.y = element_blank())
cm34= ggplot(bank) + geom_boxplot(aes(x=housing, y = duration)) + theme(axis.title.y = element_blank())
cm35 = ggplot(bank) + geom_boxplot(aes(x=loan, y = duration)) + theme(axis.title.y = element_blank())
cm36 = ggplot(bank) + geom_boxplot(aes(x=contact, y = duration)) + theme(axis.title.y = element_blank())
cm37 = ggplot(bank) + geom_boxplot(aes(x=month, y = duration)) + theme(axis.title.y = element_blank())
cm38 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = duration)) + theme(axis.title.y = element_blank())
grid.arrange(cm30,cm31,cm32,cm33,cm34,cm35,cm36,cm37,cm38, nrow=3, top = "duration by Category")
cm40 = ggplot(bank) + geom_boxplot(aes(x=job, y = campaign)) + theme(axis.title.y = element_blank())
cm41 = ggplot(bank) + geom_boxplot(aes(x=marital, y = campaign)) + theme(axis.title.y = element_blank())
cm42 = ggplot(bank) + geom_boxplot(aes(x=education, y = campaign))+ theme(axis.title.y = element_blank())
cm43 = ggplot(bank) + geom_boxplot(aes(x=default, y = campaign)) + theme(axis.title.y = element_blank())
cm44= ggplot(bank) + geom_boxplot(aes(x=housing, y = campaign)) + theme(axis.title.y = element_blank())
cm45 = ggplot(bank) + geom_boxplot(aes(x=loan, y = campaign)) + theme(axis.title.y = element_blank())
cm46 = ggplot(bank) + geom_boxplot(aes(x=contact, y = campaign)) + theme(axis.title.y = element_blank())
cm47 = ggplot(bank) + geom_boxplot(aes(x=month, y = campaign)) + theme(axis.title.y = element_blank())
cm48 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = campaign)) + theme(axis.title.y = element_blank())
grid.arrange(cm40,cm41,cm42,cm43,cm44,cm45,cm46,cm47,cm48, nrow=3, top = "campaign by Category")
cm50 = ggplot(bank) + geom_boxplot(aes(x=job, y = pdays)) + theme(axis.title.y = element_blank())
cm51 = ggplot(bank) + geom_boxplot(aes(x=marital, y = pdays)) + theme(axis.title.y = element_blank())
cm52 = ggplot(bank) + geom_boxplot(aes(x=education, y = pdays))+ theme(axis.title.y = element_blank())
cm53 = ggplot(bank) + geom_boxplot(aes(x=default, y = pdays)) + theme(axis.title.y = element_blank())
cm54= ggplot(bank) + geom_boxplot(aes(x=housing, y = pdays)) + theme(axis.title.y = element_blank())
cm55 = ggplot(bank) + geom_boxplot(aes(x=loan, y = pdays)) + theme(axis.title.y = element_blank())
cm56 = ggplot(bank) + geom_boxplot(aes(x=contact, y = pdays)) + theme(axis.title.y = element_blank())
cm57 = ggplot(bank) + geom_boxplot(aes(x=month, y = pdays)) + theme(axis.title.y = element_blank())
cm58 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = pdays)) + theme(axis.title.y = element_blank())
grid.arrange(cm50,cm51,cm52,cm53,cm54,cm55,cm56,cm57,cm58, nrow=3, top = "pdays by Category")
cm60 = ggplot(bank) + geom_boxplot(aes(x=job, y = previous)) + theme(axis.title.y = element_blank())
cm61 = ggplot(bank) + geom_boxplot(aes(x=marital, y = previous)) + theme(axis.title.y = element_blank())
cm62 = ggplot(bank) + geom_boxplot(aes(x=education, y = previous))+ theme(axis.title.y = element_blank())
cm63 = ggplot(bank) + geom_boxplot(aes(x=default, y = previous)) + theme(axis.title.y = element_blank())
cm64= ggplot(bank) + geom_boxplot(aes(x=housing, y = previous)) + theme(axis.title.y = element_blank())
cm65 = ggplot(bank) + geom_boxplot(aes(x=loan, y = previous)) + theme(axis.title.y = element_blank())
cm66 = ggplot(bank) + geom_boxplot(aes(x=contact, y = previous)) + theme(axis.title.y = element_blank())
cm67 = ggplot(bank) + geom_boxplot(aes(x=month, y = previous)) + theme(axis.title.y = element_blank())
cm68 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = previous)) + theme(axis.title.y = element_blank())
grid.arrange(cm60,cm61,cm62,cm63,cm64,cm65,cm66,cm67,cm68, nrow=3, top = "previous by Category")
cm70 = ggplot(bank) + geom_boxplot(aes(x=job, y = day)) + theme(axis.title.y = element_blank())
cm71 = ggplot(bank) + geom_boxplot(aes(x=marital, y = day)) + theme(axis.title.y = element_blank())
cm72 = ggplot(bank) + geom_boxplot(aes(x=education, y = day))+ theme(axis.title.y = element_blank())
cm73 = ggplot(bank) + geom_boxplot(aes(x=default, y = day)) + theme(axis.title.y = element_blank())
cm74= ggplot(bank) + geom_boxplot(aes(x=housing, y = day)) + theme(axis.title.y = element_blank())
cm75 = ggplot(bank) + geom_boxplot(aes(x=loan, y = day)) + theme(axis.title.y = element_blank())
cm76 = ggplot(bank) + geom_boxplot(aes(x=contact, y = day)) + theme(axis.title.y = element_blank())
cm77 = ggplot(bank) + geom_boxplot(aes(x=month, y = day)) + theme(axis.title.y = element_blank())
cm78 = ggplot(bank) + geom_boxplot(aes(x=poutcome, y = day)) + theme(axis.title.y = element_blank())
grid.arrange(cm70,cm71,cm72,cm73,cm74,cm75,cm76,cm77,cm78, nrow=3, top = "day by Category")